## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.4
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.0 ✔ feasts 0.3.2
## ✔ lubridate 1.9.3 ✔ fable 0.3.3
## ✔ ggplot2 3.5.1 ✔ fabletools 0.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ forecast 8.20 ✔ expsmooth 2.3
## ✔ fma 2.5
##
##
## Attaching package: 'fpp2'
## The following object is masked from 'package:fpp3':
##
## insurance
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:fma':
##
## pigs
## corrplot 0.92 loaded
This assignment is to be submitted via email by the designated group representative and follows the Hyndman Version for the problems assigned in batch #1. The first batch consists of 12 problems, drawn from both KJ and HA sources. The problems to be addressed are as follows:
The tasks outlined in these problems will require a thorough understanding of the materials covered in the specified chapters, ensuring a comprehensive grasp of the subject matter.
Read the data into R
Select one of the time series as follows (but replace the column name with your own chosen column):
Explore your chosen retail time series using the following functions:
autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
The autoplot shows a strong seasonality to the data, as well as an upward trend. Though there is a brief dip from 1990-2000, there is no evidence that this is part of a cycle yet.
The seasonal plot emphasizes the seasonality of the data. Sales start to rise in the fall before spiking sharply between November and December, then falling off after January, obviously coinciding with holiday shopping and sales for Christmas.
Again, the subseries highlights the seasonality of the data, but paints it clearer than the seasonal plot. Though sales rise from September, the floor actually remains the same. The only real difference is in December, which not only has a higher ceiling, but a higher floor as well.
The data is not very readable in this lag series. We can see some negative relationships and some positive relationships, but the amount of graphs, and the fact that this is monthly, make it difficult to discern much.
The decrease in lags highlights the trend, while the scalloped shape shows the seasonality of the sales data.
## time year quarter gilts profit s_per_herdsz production herdsz
## 1 1967.00 1967 1 105 8.075 10.80 2645 703
## 2 1967.25 1967 2 119 7.819 9.16 2540 722
## 3 1967.50 1967 3 119 7.366 9.38 2565 738
## 4 1967.75 1967 4 109 8.113 10.39 2776 747
## 5 1968.00 1968 1 117 7.380 9.44 2725 755
## 6 1968.25 1968 2 135 7.134 8.69 2623 780
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = production_ts, h = 4)
##
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l = 2645.2967
##
## sigma: 144.1414
##
## AIC AICc BIC
## 666.9711 667.5165 672.5847
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 12.43225 141.1065 110.1921 0.3188315 3.565448 0.3549497
## ACF1
## Training set -0.003580523
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1971 3241.984 3057.259 3426.709 2959.472 3524.496
## Feb 1971 3241.984 2980.757 3503.211 2842.472 3641.496
## Mar 1971 3241.984 2922.053 3561.915 2752.692 3731.276
## Apr 1971 3241.984 2872.563 3611.405 2677.003 3806.965
From the output, we observe the alpha to be 0.9999 and sigma to be 144.1414
## Manually computed 95% prediction interval:
## Lower Bound: 2959.47
## Upper Bound: 3524.5
Compare the manually computed interval with the one produced by R to assess their similarity. The manually computed interval provides a verification step to ensure consistency with the SES model’s prediction intervals. This comparison helps validate the accuracy and reliability of the SES model’s forecasting capabilities for the first forecasted value.
## [1] 2724.992
Based on the provided forecasts and intervals, the forecast from your custom SES function (2724.992) does not match the forecast (3241.984) produced by R’s ses() function for January 1971.
Modify your function from the previous exercise to return the sum of
squared errors rather than the forecast of the next observation. Then
use the optim() function to find the optimal values of
alpha and lo. Do you get the same values as the ses() function?
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = production_ts)
##
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l = 2645.2967
##
## sigma: 144.1414
##
## AIC AICc BIC
## 666.9711 667.5165 672.5847
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 12.43225 141.1065 110.1921 0.3188315 3.565448 0.3549497
## ACF1
## Training set -0.003580523
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1971 3241.984 3057.259 3426.709 2959.472 3524.496
## Feb 1971 3241.984 2980.757 3503.211 2842.472 3641.496
## Mar 1971 3241.984 2922.053 3561.915 2752.692 3731.276
## Apr 1971 3241.984 2872.563 3611.405 2677.003 3806.965
## May 1971 3241.984 2828.961 3655.007 2610.320 3873.648
## Jun 1971 3241.984 2789.541 3694.427 2550.033 3933.935
## Jul 1971 3241.984 2753.291 3730.677 2494.593 3989.375
## Aug 1971 3241.984 2719.551 3764.417 2442.991 4040.977
## Sep 1971 3241.984 2687.860 3796.108 2394.525 4089.443
## Oct 1971 3241.984 2657.887 3826.081 2348.685 4135.283
## [1] "Optimal alpha (optim()): 0.813489953565518"
## [1] "Optimal l0 (optim()): 3149.70117501139"
The values obtained from optim() for \(α\) and \(l_0\) are not the same as those obtained directly from the SES model using R’s ses() function. This discrepancy suggests that the optimization process (optim()) might have found a different set of parameters that minimized the sum of squared errors compared to the default parameters used by SES model.
Consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.
Step 1: Use auto.arima() to Find an Appropriate ARIMA Model
## Series: austa
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## 0.3006 0.1735
## s.e. 0.1647 0.0390
##
## sigma^2 = 0.03376: log likelihood = 10.62
## AIC=-15.24 AICc=-14.46 BIC=-10.57
Step 2: Check That the Residuals Look Like White Noise
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 6, p-value = 0.8905
##
## Model df: 1. Total lags used: 7
Step 3: Plot Forecasts for the Next 10 Periods and Plot forecasts from ARIMA
The ARIMA(0,1,1) model with drift seems to provide a reasonable fit to the austa series, with non-significant autocorrelation in the residuals and relatively low information criteria values.
The transformation is unhelpful because, while it can help stabilize the variance, it doesn’t factor in other things like seasonality and trends. Other techniques like differencing or seasonal decomp would be required here before Box-Cox at least.
Plot the time series of sales of product A. Can you identify
seasonal fluctuations and/or a trend-cycle?
Clear cycles in the data based on specific time and year
Use a classical multiplicative decomposition to calculate the
trend-cycle and seasonal indices.
Do the results support the graphical interpretation from part a? Yes. The decomposed plot confirms seasonal fluctuations and a trend cycle
Compute and plot the seasonally adjusted data.
Change one observation to be an outlier (e.g., add 500 to one
observation), and recompute the seasonally adjusted data. What is the
effect of the outlier?
Does it make any difference if the outlier is near the end rather than in the middle of the time series? If you’re weighing more recent event heavily then yes, it does make an impact. Predictability goes down if you are weighing most recent data and there is an outlier.
Question:
Use the help function to explore what the series gold, woolyrnq and gas represent.
Answer:
Let’s use the help function to explore the data series ‘gold’, ‘woolyrnq’ and ‘gas’
This data series represents the daily morning gold prices in US dollars from 1 January 1985 to 31 March 1989.
This data series tracks the Quarterly production of woollen yarn in Australia in tonnes from Mar 1965 to Sep 1994.
This data series tracks Australian monthly gas production from 1956 to 1995.
Here, we will use ‘autoplot’ to plot the time series separately
Observation:
Gold Price: The price of gold generally rises day over day until approximately day 770, where it experiences a significant spike. After this peak, the price trends downward until about day 1000.
Woolen Yarn Production: The production of woolen yarn exhibits a seasonal pattern, characterized by regular fluctuations with many ups and downs.
Gas Production: Gas production shows a seasonal pattern and a noticeable upward trend, particularly after 1970.
## [1] 1
## [1] 4
## [1] 12
To summarize, here is the frequency of the series
gold: Daily data, so the frequency should be around 365.
woolyrnq: Quarterly data, so the frequency should be 4.
gas: Monthly data, so the frequency should be 12.
## [1] 770
## [1] 593.7
The outlier in the gold series was the spike that occurred on day 770, with the gold price reaching 593.7 US dollars.
Question:
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?
Answer:
First, we need to load the data set using the library mlbench
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
This data frame has 214 observations of 10 variables. We can print the first 5 elements of the dataframe.
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
To visualize the distributions of predictor variables, we can use boxplot for numerical variables.
Overall Observations:
The RI, K, SI, and Ba show a relatively narrow range across glass types
Some elements like Na, Mg exhibit greater variability.
A correlation heatmap can also be useful to understand the relationships between the predictor variables.
There is a high and positive correlation between elements Ca and RI while Si and RI for instance is negatively correlated
Based on the boxplots for the Glass dataset, outliers are observed in several predictor variables like RI, Mg, Si, and Ca for Type 2 glass.
Let’s check for skewness using histograms
The predictor Mg is skewed to the left while K, Ba, and Fe are skewed to the right
We can use Logarithmic Transformation for instance to compress the range of values and reduce the impact of outliers.
Similarly, Box-Cox transformation can handle various types of transformations to normalize the distribution.
Question:
Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers. Explain the differences among these figures. Do they all indicate that the data are white noise?
Figure 8.31: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.
Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?
Answer:
The differences between these figures lie in the lengths of each spike. As the number of random numbers increases (from 36 to 360), spikes in the ACF plots become shorter. The ACF plots display rapid decay and lack of significant autocorrelation patterns, suggesting white noise.
The distances of critical values from the mean of zero in ACF plots reflect the precision of autocorrelation estimates, which is influenced by sample size.
Answer:
To illustrate how the plot change as we change phi 1, let’s simulate and plot two scenarios with different values of phi 1
Higher values of phi 1 lead to stronger autocorrelation and smoother patterns
Changing theta1 to -0.6
We observe that Positive values of theta1 create a smoother process with more persistence.
…